{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "df33c614-2ecc-47a0-8600-bc891681997f",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "## Welcome to the Profiling Tool for the RAPIDS Accelerator for Apache Spark\n",
    "\n",
    "To run the profiling tool, enter the log path that represents the DBFS location of your Spark GPU event logs. Then, select \"Run all\" to execute the notebook. Once the notebook completes, various output tables will appear below. For more options on running the profiling tool, please refer to the [Profiling Tool User Guide](https://docs.nvidia.com/spark-rapids/user-guide/latest/profiling/quickstart.html#running-the-tool).\n",
    "\n",
    "### Note\n",
    "- Currently, local, S3 or DBFS event log paths are supported.\n",
    "- S3 path is only supported on Databricks AWS using [instance profiles](https://docs.databricks.com/en/connect/storage/tutorial-s3-instance-profile.html).\n",
    "- Eventlog path must follow the formats `/dbfs/path/to/eventlog` or `dbfs:/path/to/eventlog` for logs stored in DBFS.\n",
    "- Use wildcards for nested lookup of eventlogs. \n",
    "   - For example: `/dbfs/path/to/clusterlogs/*/*`\n",
    "- Multiple event logs must be comma-separated. \n",
    "   - For example: `/dbfs/path/to/eventlog1,/dbfs/path/to/eventlog2`\n",
    "\n",
    "### Per-Job Profile\n",
    "\n",
    "The profiler output includes information about the application, data sources, executors, SQL stages, Spark properties, and key application metrics at the job and stage levels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "5e9f5796-46ed-49ac-9d08-c8b98a87c39d",
     "showTitle": true,
     "title": "Set Tools Version"
    },
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "TOOLS_VER = \"25.04.0\"\n",
    "print(f\"Using Tools Version: {TOOLS_VER}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "313ee58b-61b3-4010-9d60-d21eceea796c",
     "showTitle": true,
     "title": "Install Package"
    }
   },
   "outputs": [],
   "source": [
    "%pip install spark-rapids-user-tools==$TOOLS_VER > /dev/null"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "34492d18-1130-45be-b9f7-e6931d3fa66b",
     "showTitle": true,
     "title": "Environment Setup"
    },
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "\n",
    "def convert_dbfs_path(path):\n",
    "    return path.replace(\"dbfs:/\", \"/dbfs/\")\n",
    "  \n",
    "# Detect cloud provider from cluster usage tags\n",
    "valid_csps = [\"aws\", \"azure\"]\n",
    "CSP=spark.conf.get(\"spark.databricks.clusterUsageTags.cloudProvider\", \"\").lower()\n",
    "if CSP not in valid_csps:\n",
    "    print(f\"ERROR: Cannot detect cloud provider from cluster usage tags. Using '{valid_csps[0]}' as default. \")\n",
    "    CSP = valid_csps[0]\n",
    "else:\n",
    "    print(f\"Detected Cloud Provider from Spark Configs: '{CSP}'\")\n",
    "\n",
    "# Initialize variables from widgets\n",
    "dbutils.widgets.text(\"Eventlog Path\", \"/dbfs/user1/profiling_logs\")\n",
    "EVENTLOG_PATH=dbutils.widgets.get(\"Eventlog Path\")\n",
    "EVENTLOG_PATH=convert_dbfs_path(EVENTLOG_PATH)\n",
    "\n",
    "dbutils.widgets.text(\"Output Path\", \"/tmp\")\n",
    "OUTPUT_PATH=dbutils.widgets.get(\"Output Path\")\n",
    "\n",
    "# Setup environment variables\n",
    "os.environ[\"CSP\"] = CSP\n",
    "os.environ[\"EVENTLOG_PATH\"] = EVENTLOG_PATH\n",
    "os.environ[\"OUTPUT_PATH\"] = OUTPUT_PATH\n",
    "\n",
    "# Setup console output file\n",
    "CONSOLE_OUTPUT_PATH = os.path.join(OUTPUT_PATH, 'console_output.log')\n",
    "CONSOLE_ERROR_PATH = os.path.join(OUTPUT_PATH, 'console_error.log')\n",
    "os.environ['CONSOLE_OUTPUT_PATH'] = CONSOLE_OUTPUT_PATH\n",
    "os.environ['CONSOLE_ERROR_PATH'] = CONSOLE_ERROR_PATH\n",
    "print(f'Console output will be stored at {CONSOLE_OUTPUT_PATH} and errors will be stored at {CONSOLE_ERROR_PATH}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "693b5ee0-7500-43f3-b3e2-717fd5468aa8",
     "showTitle": true,
     "title": "Run Profiling Tool"
    }
   },
   "outputs": [],
   "source": [
    "%sh\n",
    "spark_rapids profiling --platform databricks-$CSP --eventlogs \"$EVENTLOG_PATH\" -o \"$OUTPUT_PATH\" --verbose > \"$CONSOLE_OUTPUT_PATH\" 2> \"$CONSOLE_ERROR_PATH\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "f83af6c8-5a79-4a46-965b-38a4cb621877",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "## Console Output\n",
    "Console output shows the recommended configurations for each app\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "c61527b7-a21a-492c-bab8-77f83dc5cabf",
     "showTitle": true,
     "title": "Show Console Output"
    },
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%sh\n",
    "cat $CONSOLE_OUTPUT_PATH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "f3c68b28-fc62-40ae-8528-799f3fc7507e",
     "showTitle": true,
     "title": "Show Logs"
    },
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%sh\n",
    "cat $CONSOLE_ERROR_PATH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "05f96ca1-1b08-494c-a12b-7e6cc3dcc546",
     "showTitle": true,
     "title": "Parse Output"
    },
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "import re\n",
    "import shutil\n",
    "import os\n",
    "\n",
    "def extract_file_info(console_output_path, output_base_path):\n",
    "    try:\n",
    "        with open(console_output_path, 'r') as file:\n",
    "            stdout_text = file.read()\n",
    "        \n",
    "        # Extract log file location\n",
    "        location_match = re.search(r\"Location: (.+)\", stdout_text)\n",
    "        if not location_match:\n",
    "            raise ValueError(\"Log file location not found in the provided text.\")\n",
    "        \n",
    "        log_file_location = location_match.group(1)\n",
    "        \n",
    "        # Extract profiling output folder\n",
    "        prof_match = re.search(r\"prof_[^/]+(?=\\.log)\", log_file_location)\n",
    "        if not prof_match:\n",
    "            raise ValueError(\"Output folder not found in the log file location.\")\n",
    "        \n",
    "        output_folder_name = prof_match.group(0)\n",
    "        output_folder = os.path.join(output_base_path, output_folder_name)\n",
    "        return output_folder, log_file_location\n",
    "    \n",
    "    except Exception as e:\n",
    "        raise RuntimeError(f\"Cannot parse console output. Reason: {e}\")\n",
    "\n",
    "def copy_logs(destination_folder, *log_files):\n",
    "    try:\n",
    "        log_folder = os.path.join(destination_folder, \"logs\")\n",
    "        os.makedirs(log_folder, exist_ok=True)\n",
    "        \n",
    "        for log_file in log_files:\n",
    "            if os.path.exists(log_file):\n",
    "                shutil.copy2(log_file, log_folder)\n",
    "            else:\n",
    "                print(f\"Log file not found: {log_file}\")\n",
    "    except Exception as e:\n",
    "        raise RuntimeError(f\"Cannot copy logs to output. Reason: {e}\")\n",
    "\n",
    "try:\n",
    "    output_folder, log_file_location = extract_file_info(CONSOLE_OUTPUT_PATH, OUTPUT_PATH)\n",
    "    print(f\"Output folder detected {output_folder}\")\n",
    "    copy_logs(output_folder, log_file_location, CONSOLE_OUTPUT_PATH, CONSOLE_ERROR_PATH)\n",
    "    print(f\"Logs successfully copied to {output_folder}\")\n",
    "except Exception as e:\n",
    "    print(e)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "8c65adcd-a933-482e-a50b-d40fa8f50e16",
     "showTitle": true,
     "title": "Download Output"
    },
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "import shutil\n",
    "import os\n",
    "import re\n",
    "\n",
    "current_working_directory = os.getcwd()\n",
    "\n",
    "def create_destination_folders(folder_name):\n",
    "    os.makedirs(folder_name, exist_ok=True)\n",
    "    base_download_folder_path = os.path.join(\"/dbfs/FileStore/\", folder_name)\n",
    "    os.makedirs(base_download_folder_path, exist_ok=True) \n",
    "    return base_download_folder_path\n",
    "\n",
    "def create_download_link(source_folder, destination_folder_name):\n",
    "    folder_to_compress = os.path.basename(source_folder)\n",
    "    zip_file_name = folder_to_compress + '.zip'\n",
    "    local_zip_file_path = os.path.join(current_working_directory, destination_folder_name, zip_file_name)\n",
    "    download_folder_path = os.path.join(destination_folder_name, zip_file_name)\n",
    "    try:\n",
    "        base_download_folder_path = create_destination_folders(destination_folder_name)\n",
    "        shutil.make_archive(folder_to_compress, 'zip', source_folder)\n",
    "        shutil.copy2(zip_file_name, base_download_folder_path)\n",
    "        if os.path.exists(local_zip_file_path):\n",
    "            os.remove(local_zip_file_path)\n",
    "        shutil.move(zip_file_name, local_zip_file_path)\n",
    "    \n",
    "        download_button_html = f\"\"\"\n",
    "        <style>\n",
    "            .download-btn {{\n",
    "                display: inline-block;\n",
    "                padding: 10px 20px;\n",
    "                font-size: 16px;\n",
    "                color: white;\n",
    "                background-color: #4CAF50;\n",
    "                text-align: center;\n",
    "                text-decoration: none;\n",
    "                border-radius: 5px;\n",
    "                border: none;\n",
    "                cursor: pointer;\n",
    "                margin: 15px auto;\n",
    "            }}\n",
    "            .download-btn:hover {{\n",
    "                background-color: #45a049;\n",
    "            }}\n",
    "            .button-container {{\n",
    "                display: flex;\n",
    "                justify-content: center;\n",
    "                align-items: center;\n",
    "            }}\n",
    "        </style>\n",
    "        \n",
    "        <div style=\"color: #444; font-size: 14px; text-align: center; margin: 10px;\">\n",
    "            Zipped output file created at {local_zip_file_path}\n",
    "        </div>\n",
    "        <div class='button-container'>\n",
    "            <a href='/files/{download_folder_path}' class='download-btn'>Download Output</a>\n",
    "        </div>\n",
    "        \"\"\"\n",
    "        displayHTML(download_button_html)\n",
    "    except Exception as e:\n",
    "        error_message_html = f\"\"\"\n",
    "        <div style=\"color: red; text-align: center; margin: 20px;\">\n",
    "            <strong>Error:</strong> Cannot create download link for {source_folder}. Reason: {e}\n",
    "        </div>\n",
    "        \"\"\"\n",
    "        displayHTML(error_message_html)\n",
    "\n",
    "destination_folder_name = \"Tools_Output\"\n",
    "create_download_link(output_folder, destination_folder_name)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "bbe50fde-0bd6-4281-95fd-6a1ec6f17ab2",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "\n",
    "%md\n",
    "\n",
    "## GPU Job Tuning Recommendations\n",
    "This has general suggestions for tuning your applications to run optimally on GPUs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "b8bca4a6-16d8-4b60-ba7b-9aff64bdcaa1",
     "showTitle": true,
     "title": "Show Recommendations"
    },
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "jar_output_folder = os.path.join(output_folder, \"rapids_4_spark_profile\")\n",
    "app_df = pd.DataFrame(columns=['appId', 'appName'])\n",
    "\n",
    "for x in os.scandir(jar_output_folder):\n",
    "    if x.is_dir():\n",
    "        csv_path = os.path.join(x.path, \"application_information.csv\")\n",
    "        if os.path.exists(csv_path):\n",
    "          tmp_df = pd.read_csv(csv_path)\n",
    "          app_df = pd.concat([app_df, tmp_df[['appId', 'appName']]])\n",
    "\n",
    "\n",
    "app_list = app_df[\"appId\"].tolist()\n",
    "app_recommendations = pd.DataFrame(columns=['app', 'recommendations'])\n",
    "\n",
    "for app in app_list:\n",
    "  app_file = open(os.path.join(jar_output_folder, app, \"profile.log\"))\n",
    "  recommendations_start = 0\n",
    "  recommendations_str = \"\"\n",
    "  for line in app_file:\n",
    "    if recommendations_start == 1:\n",
    "      recommendations_str = recommendations_str + line\n",
    "    if \"### D. Recommended Configuration ###\" in line:\n",
    "      recommendations_start = 1\n",
    "  app_recommendations = pd.concat([app_recommendations, pd.DataFrame({'app': [app], 'recommendations': [recommendations_str]})], ignore_index=True)\n",
    "display(app_recommendations)"
   ]
  }
 ],
 "metadata": {
  "application/vnd.databricks.v1+notebook": {
   "dashboards": [
    {
     "elements": [],
     "globalVars": {},
     "guid": "",
     "layoutOption": {
      "grid": true,
      "stack": true
     },
     "nuid": "91c1bfb2-695a-4e5c-8a25-848a433108dc",
     "origId": 2173122769183713,
     "title": "Executive View",
     "version": "DashboardViewV1",
     "width": 1600
    },
    {
     "elements": [],
     "globalVars": {},
     "guid": "",
     "layoutOption": {
      "grid": true,
      "stack": true
     },
     "nuid": "62243296-4562-4f06-90ac-d7a609f19c16",
     "origId": 2173122769183714,
     "title": "App View",
     "version": "DashboardViewV1",
     "width": 1920
    }
   ],
   "environmentMetadata": null,
   "language": "python",
   "notebookMetadata": {
    "mostRecentlyExecutedCommandWithImplicitDF": {
     "commandId": 2173122769183692,
     "dataframes": [
      "_sqldf"
     ]
    },
    "pythonIndentUnit": 2,
    "widgetLayout": [
     {
      "breakBefore": false,
      "name": "Eventlog Path",
      "width": 778
     },
     {
      "breakBefore": false,
      "name": "Output Path",
      "width": 302
     }
    ]
   },
   "notebookName": "[RAPIDS Accelerator for Apache Spark] Profiling Tool Notebook Template",
   "widgets": {
    "Eventlog Path": {
     "currentValue": "/dbfs/user1/profiling_logs",
     "nuid": "1272501d-5ad9-42be-ab62-35768b2fc384",
     "typedWidgetInfo": null,
     "widgetInfo": {
      "widgetType": "text",
      "defaultValue": "/dbfs/user1/profiling_logs",
      "label": "",
      "name": "Eventlog Path",
      "options": {
       "widgetType": "text",
       "autoCreated": false,
       "validationRegex": null
      }
     }
    },
    "Output Path": {
     "currentValue": "/tmp",
     "nuid": "ab7e082c-1ef9-4912-8fd7-51bf985eb9c1",
     "typedWidgetInfo": null,
     "widgetInfo": {
      "widgetType": "text",
      "defaultValue": "/tmp",
      "label": null,
      "name": "Output Path",
      "options": {
       "widgetType": "text",
       "autoCreated": null,
       "validationRegex": null
      }
     }
    }
   }
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
